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Abstract 

Background: Current epigenetic research makes frequernt use of whole-gernome ChIP profiling for determining the 
in vivo binding of proteins, e.g. transcription factors and histones, to DNA. Two important and recurrent questions for 
these large scale analyses are: 1) What is the genomic distribution of a set of binding sites? and 2) Does this genomic 
distribution differ significantly from another set of sites? 

Findings: We exemplify the functionality of the PinkThing by analysing a ChIP profiling dataset of cohesin binding 
sites. We show the subset of cohesin sites with no CTCF binding have a characteristic genomic distribution different 
from the set of all cohesin sites. 

Conclusions: The PinkThing is a web application for fast and easy analysis of the context of genomic loci, such as 
peaks from ChIP profiling experiments. The output of the PinkThing analysis includes: categorisation of position 
relative to genes (intronic, exonic, 5' near, 3' near 5' far, 3' far and distant), distance to the closest annotated 3' and 5' 
end of genes, direction of transcription of the nearest gene, and the option to include other genomic elements like 
ESTs and CpG islands. The PinkThing enables easy statistical comparison between experiments, i.e. experimental 
versus background sets, reporting over- and underrepresentation as well as p-values for all comparisons. Access and 
use of the PinkThing is free and open (without registration) to all users via the website: http://pinkthing.cmbi.ru.nl 



Findings 

Background 

Transcription factor (TF) binding sites can be identified 
in vivo using the emerging technologies for ChIP profiling 
such as ChlP-on-chip and ChlP-sequencing. These exper- 
iments locate hundreds to tens of thousands TF binding 
sites, which subsequently have to be validated and anal- 
ysed for biological function. Certain initial analyses of TF 
binding sites have already become commonplace: map- 
ping of the binding sites to the genome, detecting closest 
genes to the binding sites, categorising binding sites by 
their position relative to the genes, and the subsequent 
GO analysis of the genes closest to the binding sites. 
The same type of analyses apply to other regions found 
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by Chip profiling, e.g. with specific histone modifica- 
tion patterns or DNA hyper- or hypomethylated loci. The 
PinkThing gathers all these analyses into one single user- 
friendly tool, standardises the characterisation of genomic 
locations, uses up-to-date ENSEMBL gene annotation, 
and at the same time enables statistically sound compar- 
isons at each step of the analysis. Although it is possible to 
perform similar analysis using bioinformatic packages like 
Taverna [1] or Galaxy [2], these packages require instal- 
lation, initialisation and scripting (Taverna) or a certain 
level of statistics skills from the user (e.g. Galaxy). Another 
more specialised tool is the HyperBrowser [3], which is 
based on Galaxy and designed to incorporate functions 
to query and correlate annotation along the genome. The 
versatility of this tool is impressive, but it comes at the 
cost of usability. Compared to the PinkThing, the Hyper- 
Browser has a complex interface that takes time to learn 
to use efficiently. A more accessible tool is GREAT [4] 
which provides a web interface for analysis, with a focus 
specifically on cis-regulatory regions for human, mouse 
and zebrafish. 
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With the PinkThing tool the genomic analyses with 
statistics are available and ready-to-use directly from the 
website by a simple upload of a file containing the coor- 
dinates of genomic regions of interest. In addition, the 
PinkThing provides the option of supplying an appropri- 
ate background distribution to be used for comparisons, 
e.g. the set of all binding sites of a ChlP-seq experiment 
when examining a subset of these sites. Furthermore, the 
PinkThing analysis of genomic distributions supports all 
species that are annotated in Ensembl. PinkThing has 
already been successfully used for the analysis of ChIP 
profiling data in a wide variety of data, including [5-17]. 

Approach 

Through the PinkThing web page, the user uploads the 
genomic sites of interest using a standard format for 
genomic locations (BED format). PinkThing compares 
the sites to the Ensembl gene annotation and optionally 
to Ensembl CpG islands, ESTs and regulatory features. 



The initial results include basic statistics of the genomic 
distribution of the uploaded positions: 

• histograms of distances to the 5' and the 3' ends of 
genes, 

• a barplot and a pie chart of the genomic distribution 
relative to Ensembl genes/CpG islands/ESTs 
(Figure la). 

• a barplot of the genomic distribution. 

Every output plot is clickable for download of an SVG 
(Scalable Vector Graphics) version of the plot for publica- 
tion. Subsequent analyses can be accessed from the results 
page and include: 

• transcription start sites (TSS) within the surrounding 
regions, 

• the conservation score (GERP [18]) for these regions, 

• comparisons to other sets of genomic regions. 
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Figure 1 PinkThing produces graphical output along with the statistics of the genomic distributions. Three plots and a table produced by 
the PinkThing comparison of Cohesin sites (background) versus Cohesin-no-CTCF sites (foreground), a) PinkThing pie chart showing the overall 
genomic distribution of cohesin-no-CTCF sites directly upon upload, b) The output of the PinkThing comparison illustrating the category sizes by 
ratio of foreground over background, indicating significance by stars as indicated in Figure Ic. c) Table of statistics for Cohesin-no-CTCF (foreground) 
versus all Cohesin sites (background) with the G-test p-value for each genomic category, d) The barplot of category frequencies of the compared 
distributions, showing the relative frequencies side-by-side. For this analysis the 'near'-limit was set to 1 OOObp and the 'far' limit to 3000 bp. 
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• retrieval of the reference DNA sequence for the 
uploaded regions, 

• GO annotation and GO enrichment analysis of the 
nearest genes (only available for human and mouse). 

As an example, we considered the CTCF (CCCTC bind- 
ing factor) and cohesin binding sites as determined using 
ChlP-seq in a study by Schmidt et al [19]. In this study, 
Schmidt et al. also discovered a subset of cohesin bind- 
ing sites that did not overlap with CTCF binding sites. 
Using the PinkThing we compared the genomic distribu- 
tion of the set of cohesin sites with the set of cohesin 
sites not bound by CTCF (cohesin-no-CTCF sites). We 
chose the cutoffs for 'near' and 'far' categories in our anal- 
ysis to be lOOObp and SOOObp respectively, to match the 
categories presented in their paper [19] and examined 
PinkThing-determined categorisation and genomic distri- 
bution of the sites (Figure la). The categorisation of the 
subset cohesin-no-CTCF is very informative when viewed 
in relation to the background set of all cohesin binding 
sites: The PinkThing provides the relative sizes of each 
of the categories for the two sets of sites (Figure lb), 
the Brandt-Snedecor test statistic for comparison of the 
two distributions, as well as p-values for whether the 
differences in distributions are significant per category 
(Figure Ic) as well as a barplot showing the frequencies of 
the two distributions grouped per category (Figure Id). 

The Brandt-Snedecor test indicates that the genomic 
distribution of cohesin and cohesin-no-CTCF sites are 
significantly different [p < 22e — 16) and the individ- 
ual category comparison shows the cohesin-no-CTCF 
sites are significantly overrepresented in promoter 
regions (5' near) and exon regions, and underrep- 
resented in 3' regions as compared to the set of all 
cohesin sites (Figure Ic). The Gene Ontology (GO) 
result from the PinkThing (Additional file 1: noCTCF-vs- 
cohesin.xls) indicates that the cohesin-no-CTCF binding 
sites are, relative to all cohesin sites, overrepresented 
(adjusted p < lOe — 3) in the following GO categories: 
nucleic acid binding (G 0:0003676), transcription reg- 
ulator activity (G 0:0030528), regulation of metabolic 
process (GO:0019222) (See Supplementary material: 
noCTCF-vs-cohesin-GO.xls). The difference in genomic 
distributions suggests a difference in function, which 
was confirmed by Schmidt et al who showed that the 
cohesin-no-CTCF sites are enriched for DNA binding 
sites of tissue-specific transcription factors, and further 
explored this fact in a follow-up study [20]. 

Methods 
Database access 

Lookup and data retrieval from the Ensembl database 
was implemented using the Ensembl Perl API http://www. 
ensembl.org/ info/docs/ api/ core/ core_tutorial.html. 



Statistics 

Statistical tests and generation of plots and figures on the 
PinkThing website are implemented in R [21] and bioperl 
[22]. Detailed description of plots and their interpretation 
is in the PinkThing manual available from the PinkThing 
website. 

Categorising positions 

For each imported locus, PinkThing determines the dis- 
tance to the closest gene. Positions that fall into multiple 
categories are assigned to the highest ranking category 
in the order: exon, intron, 5' near, 3' near, 5' far, 3' far 
and distant. As the surrounding sequence may contain 
overlapping gene annotations, the gene annotations are 
merged such that each position corresponds to a unique 
genome category. The limits for "near" and "far" may be 
chosen by the user, with default values being 5kb and 25kb, 
respectively. 

Comparison of category distributions 

The initial statistical test is the Brandt-Snedecor test for 
homogeneity of k binomial distributions [23]. The test 
assumes i) that the observations are independent; ii) that 
the set chosen as background distribution is at least twice 
the size as the set chosen as sample distribution and Hi) 
that the k categories are exhaustive and mutually exclu- 
sive. Our implementation of the Brandt-Snedecor test 
does not test whether there is an overlap of actual observa- 
tions in the two sets of the comparison. The test compares 
the distribution of observations over categories, indiffer- 
ent to the identity of the observation. In our example 
above we show how we apply the comparison between 
a set of sites (cohesin binding sites) and a subset of 
those sites (cohesin-noCTCF sites), but the test can also 
be applied to compare two disparate sets of sites in the 
genome. 

Let ni and n2 be the total counts for each of the distri- 
butions 1 and 2, with n = ni -\- n2 and let n2i and nu be 
the counts within category /, then the test statistic is given 
by Equation 1, and the p- value obtained by comparing to 
a X^_i distribution. If the distributions are identical, the 
value of the test will be 1. 



[H]f = 



«1 X «2 




(1) 



To determine whether each of the genomic categories is 
significantly overrepresented, we apply the log likelihood 
G-test for independence [24] per category, comparing 
each category against the union of the other categories. 

GO analysis 

The PinkThing uses Ontologizer [25] to perform GO 
statistics on the set of genes that are closest to the 
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uploaded set of sites. The output includes both the graph- 
ical Ontologizer visualization of overrepresented cate- 
gories as well as a table with p-values of all terms. The 
GO statistics can either be calculated against the whole 
genome as background or against the genes correspond- 
ing to another uploaded set. Currently the GO ontology 
analysis is available for human and mouse. 

Conclusion 

Determination of the annotation context of genomic loci 
is an indispensable foundation for their functional anal- 
ysis. To allow distributions of genomic categories to be 
compared within the same genome assembly, the imple- 
mentation of categorisation in PinkThing consistently 
assigns exactly one category to any genomic position, thus 
providing consistent and reproducible analysis. The Pink- 
Thing categorisation into consistent genomic categories 
allows the comparison with a background or reference 
distribution. This comparison is essential when examin- 
ing distributions where the categories vary in size and 
abundance dependent on the context of the genome (the 
species) and the experiment (e.g. selecting for specific 
genomic elements). PinkThing makes it easy to com- 
pare results from a specific experiment with a chosen 
background distribution, thus increasing the value and 
confidence in interpreting results. 

PinkThing is a collection of the most common genomic 
analyses related to genomic context, combined in a simple 
point and click web interface. With no prerequisites other 
than obtaining a dataset, PinkThing provides easy access 
to sound statistical analysis of genomic location data. 

Availability and requirements 

Access and use of the PinkThing is free and open (without 
registration) to all users via the website: http://pinkthing. 
cmbi.ru.nl 

• Project name: The PinkThing for analysing ChIP 
profiling data in their genomic context 

• Project home page: http://www.bioinformatics.org/ 
websvn/listing.php?repname=pinkthing 

• Operating systems: Usage of web tool is platform 
independent, access is available via all major web 
browsers at http://pinkthing.cmbi.ru.nl . 

• Code repository (SVN): http://www.bioinformatics. 
org/websvn/listing.php?repname=pinkthing 

• License: GNU GPL 

The two data sets with cohesin binding sites with and 
without CTCF are available for download from the front 
page of the PinkThing and included with the additional 
files for this article (Additional file 2: all cohesin binding 
sites, cohesin.bed; Additional file 3: cohesin binding sites 
without CTCF, no_CTCF_cohesin.bed). 



To reproduce the results in this paper: 

1. Go to http://pinkthing.cmbi.ru.nl and select Ensembl 
version 53. 

2. Then enter the desired cutoff for near and far 
positions as 1000 and 3000 respectively. 

3. Upload the two files one at a time by selecting the file 
location and click upload and calculate'. 

4. To perform the comparison, from the start page of 
the PinkThing, select the two uploaded files in the 
box 'Compare genomic distributions', choosing the 
cohesin sites with no CTCF binding as the sample 
file (foreground) and choosing the set of all cohesin 
binding sites as sample space(background). 

5. Click 'Compare' to view the results. 

6. To find overrepresented ontologies, from the start 
page of the PinkThing, select the two uploaded files 
in the respected boxes for Sample space (all cohesin 
binding sites) and sample file (cohesin binding sites 
without CTCF) and click Ontologize. 

Additional files 



Additional file 1 : noCTCF-vs-cohesin-GO.xls. The Ontologizer result of 
comparing the cohesin sites without CTCF (the study set) against the 
background set of all cohesin sites. 

Additional file 2: cohesin.bed. Cohesin binding sites in BED format, as 
obtained from Schmidt etal. [1 9], 

Additional file 3: No CTCF cohesin.bed. Sites of cohesin binding with 
no CTCF binding, as obtained from Schmidt etal. [1 9]. 
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